Video-Text Pre-training with Learned Regions for Retrieval

نویسندگان

چکیده

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract features raw pixels in an end-to-end fashion. However, these methods operate frame-level directly thus overlook spatio-temporal structure of objects video, which yet has a strong synergy with nouns descriptions. In this work, we propose simple effective module for representation learning, namely RegionLearner, can take into account during on pairs. Given our (1) first quantizes continuous clustering patch-features same cluster according to content similarity, then (2) generates learnable masks aggregate fragmentary regions complete semantics, finally (3) models dependencies different semantic regions. contrast using off-the-shelf object detectors, proposed does not require explicit supervision is much more computationally efficient. We pre-train approach public WebVid2M CC3M datasets. Extensive evaluations four downstream retrieval benchmarks clearly demonstrate effectiveness RegionLearner.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learned Lexicon-Driven Interactive Video Retrieval

We combine in this paper automatic learning of a large lexicon of semantic concepts with traditional video retrieval methods into a novel approach to narrow the semantic gap. The core of the proposed solution is formed by the automatic detection of an unprecedented lexicon of 101 concepts. From there, we explore the combination of query-by-concept, query-by-example, query-bykeyword, and user in...

متن کامل

Automatic text regions location in video frames

Content-based information retrieval from digital video databases and media archives is a challenging problem and is rapidly gaining widespread research and commercial interest. For a reliable retrieval and intelligent access to video programs, indexing should provide semantic descriptors. One way to include more semantic knowledge into the indexing process is to use the text embedded within ima...

متن کامل

A Pre-viewing Step in Video Retrieval

Video files are very complex objects. For many years, researchers developed models to allow for search-and-retrieval systems specific for these objects. Since the results of a query will be a set of videos or of segments of videos, their size may be prohibitive, and do not allow for pre-validation before downloading. Moreover, many features of the video files for example the multiplicity of the...

متن کامل

Video Information Retrieval: Lessons Learned with the Informedia Digital Video Library

Video contains multiple types of audio and visual information, which are difficult to extract, combine or trade-off in general video information retrieval. This paper provides an evaluation on the effects of different types of information used for video retrieval from a video collection. A number of different sources of information are present in most typical broadcast video collections and can...

متن کامل

Fast Video Retrieval under Sparse Training Data

Feature selection for video retrieval applications is impractical with existing techniques, because of their high time complexity and their failure on the relatively sparse training data that is available given video data size. In this paper we present a novel heuristic method for selecting image features for video, called the Complement Sort-Merge Tree (CSMT). It combines the virtues of a wrap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i3.25414